ISO/IEC 2022

ISO/IEC 2022 Information technology—Character code structure and extension techniques, is an ISO standard (equivalent to the ECMA standard ECMA-35^[1] ) specifying

a technique for including multiple character sets in a single character encoding system, and
a technique for representing these character sets in both 7 and 8 bit systems using the same encoding.

Many of the character sets included as ISO/IEC 2022 encodings are 'double byte' encodings where two bytes correspond to a single character. This makes ISO-2022 a variable width encoding. But a specific implementation does not have to implement all of the standard; the conformance level and the supported character sets are defined by the implementation.

1 Introduction
2 Code structure
3 ISO/IEC 2022 character sets
4 Comparison with other encodings
- 4.1 Advantages
- 4.2 Disadvantages
5 See also
6 References
7 External links

Introduction

Many languages or language families not based on the Latin alphabet such as Greek, Russian, Arabic, or Hebrew have historically been represented on computers with different 8-bit extended ASCII encodings. Written East Asian languages, specifically Chinese, Japanese, and Korean, use far more characters than can be represented in an 8-bit computer byte and were first represented on computers with language-specific double byte encodings.

ISO/IEC 2022 was developed as a technique to attack both of these problems: to represent characters in multiple character sets within a single character encoding, and to represent large character sets.

A second requirement of ISO-2022 was that it should be compatible with 7-bit communication channels. So even though ISO-2022 is an 8-bit character set any 8-bit sequence can be reencoded to use only 7-bits without loss and normally only a small increase in size.

To represent multiple character sets, the ISO/IEC 2022 character encodings include escape sequences which indicate the character set for characters which follow. The escape sequences are registered with ISO and follow the patterns defined within the standard. These character encodings require data to be processed sequentially in a forward direction since the correct interpretation of the data depends on previously encountered escape sequences. Note, however, that other standards such as ISO-2022-JP may impose extra conditions such as the current character set is reset to US-ASCII before the end of a line.

To represent large character sets, ISO/IEC 2022 builds on ISO/IEC 646's property that one seven bit character will normally define 94 graphic (printable) characters (in addition to space and 33 control characters). Using two bytes, it is thus possible to represent up to 8836 (94×94) characters; and, using three bytes, up to 830584 (94×94×94) characters. Though the standard defines it, no registered character set uses three bytes. For the two-byte character sets, the code point of each character is normally specified in so-called kuten (Japanese: 区点) form (sometimes called quwei (Chinese: 区位), especially when dealing with GB2312 and related standards), which specifies a zone (区, Japanese: ku, Chinese: qu), and the point (Japanese: 点 ten) or position (Chinese: 位 wei) of that character within the zone.

The escape sequences therefore do not only declare which character set is being used, but also, by knowing the properties of these character sets, know whether a 94-, 96-, 8836-, or 830584-character (or some other sized) encoding is being dealt with.

In practice, the escape sequences declaring the national character sets may be absent if context or convention dictates that a certain national character set is to be used. For example, ISO-8859-1 states that no defining escape sequence is needed and RFC 1922, which defines ISO-2022-CN, allows ISO-2022 SHIFT characters to be used without explicit use of escape sequences.

The ISO-2022 definitions of the ISO-8859-X character sets are specific fixed combinations of the components that form ISO-2022. Specifically the lower control characters (C0) the US-ASCII character set (in GL) and the upper control characters (C1) are standard and the high characters (GR) are defined for each of the ISO-8859-X variants; for example ISO-8859-1 is defined by the combination of ISO-IR-1, ISO-IR-6, ISO-IR-77 and ISO-IR-100 with no shifts or character changes allowed.

Although ISO/IEC 2022 character sets using control sequences are still in common use, particularly ISO-2022-JP, most modern e-mail applications are converting to use the simpler Unicode transforms such as UTF-8. The encodings that don't use control sequences, such as the ISO-8859 sets are still very common.

Code structure

ISO/IEC 2022 coding specifies a two-layer mapping between character codes and displayed characters. Escape sequences allow any of a large registry of graphic character sets to be "designated" into one of four working sets, named G0 through G3, and shorter control sequences specify the working set that is "invoked" to interpret bytes in the stream.

Character codes from the 7-bit ASCII graphic range (0x20–0x7F) are referred to as "GL" codes, being on the left side of a character code table, while codes from the "high ASCII" range (0xA0–0xFF), if available, are referred to as the "GR" codes.

By default, GL codes specify G0 characters, and GR codes specify G1 characters, but this may be modified with control codes or by prior agreement:

Code	Abbr.	Name	Effect
`0x0F`	SI LS0	Shift In Locking shift zero	GL encodes G0 from now on
`0x0E`	SO LS1	Shift Out Locking shift one	GL encodes G1 from now on
`ESC 0x6E` (n)	LS2	Locking shift two	GL encodes G2 from now on
`ESC 0x6F` (o)	LS3	Locking shift three	GL encodes G3 from now on
`0x8E ESC 0x4E` (N)	SS2	Single shift two	GL encodes G2 for next character only
`0x8F ESC 0x4F` (O)	SS3	Single shift three	GL encodes G3 for next character only
`ESC 0x7E` (~)	LS1R	Locking shift one right	GR encodes G1 from now on
`ESC 0x7D` (})	LS2R	Locking shift two right	GR encodes G2 from now on
`ESC 0x7C` (\|)	LS3R	Locking shift three right	GR encodes G3 from now on

Each of the four working sets may be a 94-character set or a 94ⁿ-character set. Additionally, G1 through G3 may be a 96- or 96ⁿ-character set. When one of the latter is invoked in the GL region, the space and delete characters (codes 0x20 and 0x7F) are not available.

There are additional (rarely used) features for switching control character sets, but this is a single-level lookup: the 0x00–0x1F range is the C0 control character set, the 0x80–0x9F range is the C1 control character set, and there are escape sequences which switch in various alternatives. It is required that any C0 character set include the ESC character at position 0x1B, so that further changes are possible.

As seen in the SS2 and SS3 examples above, single control characters from the C1 control character set may be invoked using only 7 bits using the sequences ESC 0x40 (@) through ESC 0x5F (_). Additional control functions are assigned in the range ESC 0x60 (`) through ESC 0x7E (~). While this article describes escape sequences using the corresponding ASCII characters, they are actually defined in terms of byte values, and the graphic assigned to that byte value may be altered without affecting the control sequence.

Escape sequences to designate character sets take the form ESC I [I...] F, where there are one or more intermediate I bytes from the range 0x20–0x2F, and a final F byte from the range 0x40–0x7F. (The range 0x30–0x3F is reserved for private-use F bytes.) The I bytes identify the type of character set and the working set it is to be designated to, while the F byte identifies the character set itself.

Code	Hex	Abbr.	Name	Effect
`ESC ! F`	`1B 21 F`	CZD	C0-designate	F selects a C0 control character set to be used.
`ESC " F`	`1B 22 F`	C1D	C1-designate	F selects a C1 control character set to be used.
`ESC % F`	`1B 25 F`	DOCS	Designate other coding system	F selects an 8-bit code; use `ESC % @` to return to ISO/IEC 2022.
`ESC % / F`	`1B 25 2F F`	DOCS	Designate other coding system	F selects an 8-bit code; there is no standard way to return.
`ESC ( F`	`1B 28 F`	GZD4	G0-designate 94-set	F selects a 94-character set to be used for G0.
`ESC ) F`	`1B 29 F`	G1D4	G1-designate 94-set	F selects a 94-character set to be used for G1.
`ESC * F`	`1B 2A F`	G2D4	G2-designate 94-set	F selects a 94-character set to be used for G2.
`ESC + F`	`1B 2B F`	G3D4	G3-designate 94-set	F selects a 94-character set to be used for G3.
`ESC - F`	`1B 2D F`	G1D6	G1-designate 96-set	F selects a 96-character set to be used for G1.
`ESC . F`	`1B 2E F`	G2D6	G2-designate 96-set	F selects a 96-character set to be used for G2.
`ESC / F`	`1B 2F F`	G3D6	G3-designate 96-set	F selects a 96-character set to be used for G3.
`ESC $ ( F`	`1B 24 28 F`	GZDM4	G0-designate multibyte 94-set	F selects a 94ⁿ-character set to be used for G0.
`ESC $ ) F`	`1B 24 29 F`	G1DM4	G1-designate multibyte 94-set	F selects a 94ⁿ-character set to be used for G1.
`ESC $ * F`	`1B 24 2A F`	G2DM4	G2-designate multibyte 94-set	F selects a 94ⁿ-character set to be used for G2.
`ESC $ + F`	`1B 24 2B F`	G3DM4	G3-designate multibyte 94-set	F selects a 94ⁿ-character set to be used for G3.
`ESC $ - F`	`1B 24 2D F`	G1DM6	G1-designate multibyte 96-set	F selects a 96ⁿ-character set to be used for G1.
`ESC $ . F`	`1B 24 2E F`	G2DM6	G2-designate multibyte 96-set	F selects a 96ⁿ-character set to be used for G2.
`ESC $ / F`	`1B 24 2F F`	G3DM6	G3-designate multibyte 96-set	F selects a 96ⁿ-character set to be used for G3.

Note that the registry of F bytes is independent for the different types. The 94-character graphic set designated by ESC ( A through ESC + A is not related in any way to the 96-character set designated by ESC - A through ESC / A. And neither of those is related to the 94ⁿ-character set designated by ESC $ ( A through ESC $ + A, and so on; the final bytes must be interpreted in context. (Indeed, without any intermediate bytes, ESC A is a way of specifying the C1 control code 0x81.)

Also note that C0 and C1 control character sets are independent; the C0 control character set designated by ESC ! A (which happens to be the NATS control set for newspaper text transmission) is not the same as the C1 control character set designated by ESC " A (the CCITT attribute control set for Videotex).

Additional I bytes may be added before the F byte to extend the F byte range. This is currently only used with 94-character sets, where codes of the form ESC ( ! F have been assigned. At the other extreme, no multibyte 96-sets have been registered, so the sequences above are strictly theoretical.

ISO/IEC 2022 character sets

Character encodings using ISO/IEC 2022 mechanism include:

ISO-2022-JP. A widely used encoding for Japanese. Starts in ASCII and includes the following escape sequences
- ESC ( B to switch to ASCII (1 byte per character)
- ESC ( J to switch to JIS X 0201-1976 (ISO/IEC 646:JP) Roman set (1 byte per character)
- ESC $ @ to switch to JIS X 0208-1978 (2 bytes per character)
- ESC $ B to switch to JIS X 0208-1983 (2 bytes per character)
ISO-2022-JP-1. The same as ISO-2022-JP with one additional escape sequence
- ESC $ ( D to switch to JIS X 0212-1990 (2 bytes per character)
ISO-2022-JP-2. A multilingual extension of ISO-2022-JP. The same as ISO-2022-JP-1 with the following additional escape sequences [1]
- ESC $ A to switch to GB 2312-1980 (2 bytes per character)
- ESC $ ( C to switch to KS X 1001-1992 (2 bytes per character)
- ESC . A to switch to ISO/IEC 8859-1 high part, Extended Latin 1 set (1 byte per character) [designated to G2]
- ESC . F to switch to ISO/IEC 8859-7 high part, Basic Greek set (1 byte per character) [designated to G2]
ISO-2022-JP-3. The same as ISO-2022-JP with three additional escape sequences
- ESC ( I to switch to JIS X 0201-1976 Kana set (1 byte per character)
- ESC $ ( O to switch to JIS X 0213-2000 Plane 1 (2 bytes per character)
- ESC $ ( P to switch to JIS X 0213-2000 Plane 2 (2 bytes per character)
ISO-2022-JP-2004. The same as ISO-2022-JP-3 with one additional escape sequence
- ESC $ ( Q to switch to JIS X 0213-2004 Plane 1 (2 bytes per character)
ISO-2022-KR. An encoding for Korean.
- ESC $ ) C to switch to KS X 1001-1992,^[2]^[3] previously named KS C 5601-1987 (2 bytes per character) [designated to G1]
ISO-2022-CN. An encoding for Chinese.
- ESC $ ) A to switch to GB 2312-1980 (2 bytes per character) [designated to G1]
- ESC $ ) G to switch to CNS 11643-1992 Plane 1 (2 bytes per character) [designated to G1]
- ESC $ * H to switch to CNS 11643-1992 Plane 2 (2 bytes per character)
ISO-2022-CN-EXT. The same as ISO-2022-CN with six additional escape sequences
- ESC $ ) E to switch to ISO-IR-165 (2 bytes per character) [designated to G1]
- ESC $ + I to switch to CNS 11643-1992 Plane 3 (2 bytes per character) [designated to G3]
- ESC $ + J to switch to CNS 11643-1992 Plane 4 (2 bytes per character) [designated to G3]
- ESC $ + K to switch to CNS 11643-1992 Plane 5 (2 bytes per character) [designated to G3]
- ESC $ + L to switch to CNS 11643-1992 Plane 6 (2 bytes per character) [designated to G3]
- ESC $ + M to switch to CNS 11643-1992 Plane 7 (2 bytes per character) [designated to G3]

The character after the ESC (for single-byte character sets) or ESC $ (for multi-byte character sets) specifies the type of character set and working set that is designated to. In the above examples, the character ( (0x28) designates a 94-character set to the G0 character set. This may be replaced by ), * or + (0x29–0x2B) to designate to the G1–G3 character sets.

Two of the codes above are 96-character codes, and in the above examples, the character - (0x2D) designates to the G1 character set. This may be replaced with . or / (0x2E or 0x2F) to designate to the G2 or G3 character sets. As mentioned earlier, a 96-character set may not be designated to the G0 set.

There are three special cases for multi-byte codes. The code sequences ESC $ @, ESC $ A, and ESC $ B were all registered before the ISO/IEC 2022 standard was finalized, so must be accepted as synonyms for the sequences ESC $ ( @ through ESC $ ( B to designate to the G0 character set. The latter form may also be used, and may be adapted by changing the ( character to designate to the G1 through G3 character sets.

The standard also defines a way to specify coding systems that do not follow its own structure. Of particular interest, the sequence ESC % G designates the UTF-8 coding system, which does not reserve the range 0x80–0x9F for control characters.

Comparison with other encodings

Advantages

ISO/IEC 2022 is one way to represent a large set of characters in a system limited to 7 bit encodings. Generally, this 7 bit restriction is not really an advantage, except for backwards compatibility with older systems. The vast majority of modern computers use 8 bits for each byte.

Disadvantages

Since ISO/IEC 2022 is a stateful encoding, a program cannot jump in the middle of a block of text to search, insert or delete characters. This makes manipulation of the text very cumbersome and slow when compared to non-stateful encodings. Any jump in the middle of the text may require a back up to the previous escape sequence before the bytes following the escape sequence can be interpreted.
Since characters can be represented in multiple ways in ISO/IEC 2022 due to its stateful nature, two visually identical strings can not be reliably compared for equality. You can use single shifts, locking shifts or the same character from more than one character set.
Some systems, like DICOM and several e-mail clients, use a variant of ISO-2022 in addition to supporting several other encodings.^[4] This type of variation makes it difficult to portably transfer text between computer systems.

References

Lunde, Ken. CJKV Information Processing. Cambridge, Massachusetts: O'Reilly & Associates, 1998. ISBN 1-56592-224-7.

External links

ISO/IEC 2022:1994
ISO/IEC 2022:1994/Cor 1:1999
ECMA-35, equivalent to ISO/IEC 2022 and freely downloadable.
International Register of Coded Character Sets to be Used with Escape Sequences, a full list of assigned character sets and their escape sequences
History of Character Codes in North America, Europe, and East Asia
CJK.INF: a document on encoding Chinese, Japanese, and Korean (CJK) languages, including a discussion of the various variants of ISO/IEC 2022. Also available by HTTP.

RFCs

RFC 1468: description of ISO-2022-JP
RFC 2237: description of ISO-2022-JP-1
RFC 1554: description of ISO-2022-JP-2
RFC 1922: description of ISO-2022-CN and ISO-2022-CN-EXT
RFC 1557: description of ISO-2022-KR

Character encodings

Character sets

Early telecommunications	ASCII ISO/IEC 646 ISO/IEC 6937 T.61 sixbit code pages Baudot code Morse code Chinese telegraph code

ISO/IEC 8859	-1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 -15 -16

Bibliographic use	ANSEL ISO 5426 / 5426-2 / 5427 / 5428 / 6438 / 6861 / 6862 / 10585 / 10586 / 10754 / 11822 MARC-8

National standards	ArmSCII CNS 11643 GOST 10859 GB 2312 HKSCS ISCII JIS X 0201 JIS X 0208 JIS X 0212 JIS X 0213 KPS 9566 KS X 1001 PASCII TIS-620 TSCII VISCII YUSCII

EUC	CN JP KR TW

ISO/IEC 2022	CN JP KR CCCII

MacOS codepages ("scripts")	Arabic CentralEurRoman ChineseSimp / EUC-CN ChineseTrad / Big5 Croatian Cyrillic Devanagari Dingbats Farsi Greek Gujarati Gurmukhi Hebrew Icelandic Japanese / ShiftJIS Korean / EUC-KR Roman Romanian Symbol Thai / TIS-620 Turkish Ukrainian

DOS codepages	437 720 737 775 850 852 855 857 858 860 861 862 863 864 865 866 869 Kamenický Mazovia MIK Iran System

Windows codepages	874 / TIS-620 932 / ShiftJIS 936 / GBK 949 / EUC-KR 950 / Big5 1250 1251 1252 1253 1254 1255 1256 1257 1258 1361 54936 / GB18030

EBCDIC codepages	37/1140 273/1141 277/1142 278/1143 280/1144 284/1145 285/1146 297/1147 420/16804 424/12712 500/1148 838/1160 871/1149 875/9067 930/1390 933/1364 937/1371 935/1388 939/1399 1025/1154 1026/1155 1047/924 1112/1156 1122/1157 1123/1158 1130/1164 JEF KEIS

Platform specific	ATASCII CDC display code DEC-MCS DEC Radix-50 Fieldata GSM 03.38 HP roman8 PETSCII TI calculator character sets WISCII ZX Spectrum character set

Unicode / ISO/IEC 10646	UTF-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-7 UTF-1 UTF-EBCDIC GB 18030 SCSU BOCU-1

Miscellaneous codepages	APL Cork HZ IBM code page 1133 KOI8 TRON

Related topics	control character (C0 C1) CCSID Character encodings in HTML charset detection Han unification ISO 6429/IEC 6429/ANSI X3.64 mojibake

ISO standards

Lists: List of ISO standards · List of ISO romanizations · List of IEC standards Categories: Category:ISO standards · Category:OSI protocols

1 to 9999	1 · 2 · 3 · 4 · 5 · 6 · 7 · 9 · 16 · 31 (-0, -1, -2, -3, -4, -5, -6, -7, -8, -9, -10, -11, -12, -13) · 128 · 216 · 217 · 226 · 228 · 233 · 259 · 269 · 302 · 306 · 428 · 518 · 519 · 639 (-1, -2, -3, -5, -6) · 646 · 690 · 732 · 764 · 843 · 898 · 1000 · 1004 · 1007 · 1073-1 · 1413 · 1538 · 1745 · 2014 · 2015 · 2022 · 2108 · 2145 · 2146 · 2240 · 2281 · 2709 · 2711 · 2788 · 2852 · 3029 · 3103 · 3166 (-1, -2, -3) · 3297 · 3307 · 3602 · 3864 · 3901 · 3977 · 4031 · 4157 · 4217 · 5218 · 5775 · 5776 · 5800 · 5964 · 6166 · 6344 · 6346 · 6425 · 6429 · 6438 · 6523 · 6709 · 7001 · 7002 · 7098 · 7185 · 7200 · 7498 · 7736 · 7810 · 7811 · 7812 · 7813 · 7816 · 8000 · 8178 · 8217 · 8571 · 8583 · 8601 · 8632 · 8652 · 8691 · 8807 · 8820-5 · 8859 (-1, -2, -3, -4, -5, -6, -7, -8, -9, -10, -11, -12, -13, -14, -15, -16) · 8879 · 9000/9001 · 9075 · 9126 · 9241 · 9362 · 9407 · 9506 · 9529 · 9564 · 9594 · 9660 · 9897 · 9945 · 9984 · 9985 · 9995

10000 to 19999	10006 · 10118-3 · 10160 · 10161 · 10165 · 10179 · 10206 · 10218 · 10303 (-11, -21, -22, -28, -238) · 10383 · 10487 · 10585 · 10589 · 10646 · 10664 · 10746 · 10861 · 10957 · 10962 · 10967 · 11073 · 11170 · 11179 · 11404 · 11544 · 11783 · 11784 · 11785 · 11801 · 11898 · 11940 · 11941 · 11941 (TR) · 11992 · 12006 · 12182 · 12207 · 12234-2 · 13211 (-1, -2) · 13216 · 13250 · 13399 · 13406-2 · 13407 · 13450 · 13485 · 13490 · 13567 · 13568 · 13584 · 13616 · 14000 · 14031 · 14396 · 14443 · 14496-10 · 14496-14 · 14644 (-1, -2, -3, -4, -5, -6, -7, -8, -9) · 14649 · 14651 · 14698 · 14698-2 · 14750 · 14882 · 14971 · 15022 · 15189 · 15288 · 15291 · 15292 · 15408 · 15444 · 15445 · 15438 · 15504 · 15511 · 15686 · 15693 · 15706 · 15706-2 · 15707 · 15897 · 15919 · 15924 · 15926 · 15926 WIP · 15930 · 16023 · 16262 · 16750 · 17024 · 17025 · 17369 · 17799 · 18000 · 18004 · 18014 · 18245 · 18629 · 18916 · 19005 · 19011 · 19092-1 · 19092-2 · 19114 · 19115 · 19125 · 19136 · 19439 · 19501:2005 · 19752 · 19757 · 19770 · 19775-1 · 19794-5

20000+	20000 · 20022 · 21000 · 21047 · 21827:2002 · 22000 · 23270 · 23360 · 24613 · 24707 · 25178 · 26000 · 26300 · 26324 · 27000 series · 27000 · 27001 · 27002 · 27003 · 27004 · 27005 · 27006 · 27007 · 27729 · 27799 · 28000 · 29110 · 29199-2 · 29500 · 31000 · 32000 · 38500 · 42010 · 80000

See also: All articles beginning with "ISO"